SAINS MALAYSIANA

Sains Malaysiana 53(7)(2024): 1715-1728

http://doi.org/10.17576/jsm-2024-5307-18

Machine Learning for Mapping and Forecasting Poverty in North Sumatera: A Data-Driven Approach

(Pembelajaran Mesin untuk Pemetaan dan Ramalan Kemiskinan di Sumatera Utara: Pendekatan Dipacu Data)

ARNITA*, FARIDAWATY MARPAUNG, FANNY RAMADHANI & DEWAN DINATA

Department of Mathematics, Universitas Negeri Medan, Jl. Williem Iskandar Pasar V, Medan, Indonesia

Received: 20 August 2023/Accepted: 13 May 2024

Abstract

Discussing poverty is crucial because it affects many facets of society, including socioeconomic disparity, crime, and the inability to obtain high-quality education. One of the provinces with the highest poverty rate in Indonesia is North Sumatra. A strategy is required to gather accurate data to effectively reduce poverty. Poverty mapping and prediction were conducted in North Sumatra to get a precise spatial distribution of poverty, the operation of the poverty model, and forecasting using machine learning (ML). Poverty prediction was conducted using a random forest (RF) algorithm and poverty mapping was conducted using the K-Means algorithm. The poverty mapping showed a significant inertia value decline in the third and fourth clusters of the elbow graph. The third cluster (0.313) was superior to the fourth cluster (0.244) in the silhouette index. Thus, there were three poverty clusters - low, medium, and high - that were used in the model. The best model was created using the grid search cross-validation, while the best prediction results were created using the RF algorithm, with the following parameters: n-estimator = 50, max depth = 10, min samples split = 2, and min samples leaf = 1. The mean squared error (MSE) of the RF model's predictions was 0.002617, or satisfactory precision.

Keywords: Cross validation, grid search; K-Means; poverty; random forest regression

Abstrak

Isu kemiskinan merupakan isu penting untuk dibincangkan kerana kemiskinan mempengaruhi pelbagai aspek kehidupan seperti jurang sosio-ekonomi, jenayah serta akses yang terhad kepada pendidikan berkualiti. Sumatera Utara merupakan salah satu daripada 5 wilayah teratas dengan jumlah kemiskinan tertinggi di Indonesia. Suatu strategi diperlukan untuk mendapatkan maklumat kemiskinan yang tepat supaya pengurusan kemiskinan disasarkan dan berkesan. Oleh itu, pemetaan dan ramalan kemiskinan dijalankan bagi mendapatkan maklumat yang lebih terperinci tentang taburan reruang kemiskinan dan apakah model kemiskinan di Sumatera Utara. Pendekatan yang diambil untuk memetakan dan meramalkan kemiskinan di Sumatera Utara ialah dengan menggunakan pembelajaran mesin (ML). Pemetaan kemiskinan dijalankan dengan menggunakan algoritma K-Means, manakala ramalan kemiskinan dijalankan menggunakan algoritma hutan rawak (RF). Hasil yang diperoleh daripada pemetaan kemiskinan di Wilayah Sumatera Utara jika dilihat daripada graf siku menunjukkan graf tersebut masih mengalami penurunan nilai inersia yang mendadak pada kelompok ke-3 dan ke-4. Manakala jika dilihat dari nilai indeks Siluet, kelompok ke-3 adalah lebih tinggi daripada kelompok ke-4 dengan nilai indeks Siluet masing-masing adalah 0.313 dan 0.244. Maka dapat disimpulkan bahawa kluster kemiskinan yang digunakan ialah 3 dengan label rendah, sederhana dan tinggi. Manakala, hasil ramalan menggunakan algoritma hutan rawak dengan teknik keesahan silang carian grid memperoleh model terbaik dengan parameter n penganggar = 50, kedalaman maks = 10, min pecahan sampel = 2 dan min sampel daun = 1. Peramalan model RF menghasilkan ketepatan tinggi yang mencukupi dan Min Ralat Kuasa Dua (MSE) ialah 0.002617.

Kata kunci: Carian grid; keesahan silang; kemiskinan; K-Means; regresi hutan rawak

REFERENCES

Ade Bastian, Harun Sujadi & Gigin Febrianto. 2018. Penerapan algoritma k-means clustering analysis pada penyakit menular manusia (studi kasus kabupaten Majalengka). Jurnal Sistem Informasi 14(1): 26-32. https://doi.org/10.21609/jsi.v14i1.566

Ade Syahputra, Mulyanto, Agustinus Suryantoro & Lukman Hakim. 2022. Analisis deskriptif potensi daerah dan tingkat kemiskinan di Sumatera Utara. Prosiding Seminar Nasional Universitas Abdurachman Saleh Situbondo, September 2022. pp. 38-47. https://unars.ac.id/ojs/index.php/prosidingSDGs/article/view/2308%0Ahttps://unars.ac.id/ojs/index.php/prosidingSDGsarticle /download/2308/1629

Alpayidin, E. 2004. Introduction to Machine Learning (Adaptive Computation and Machine Learning Series). Vol. 14. Massachusetts: The MIT Press. https://doi.org/10.1017/s1351324906004438

Anggi Aprillia, Rulyanti Susi Wardhani & Muhammad Faisal Akbar. 2021. Analysis of factors affecting poverty in the province of the Bangka Belitung Islands. Jurnal Ilmu Ekonomi Terapan 6(2): 188-201. https://doi.org/10.20473/jiet.v6i2.29184

Ao, Y., Li, H., Zhu, L., Ali, S. & Yang, Z. 2019. The linear random forest algorithm and its advantages in machine learning assisted logging regression modeling. Journal of Petroleum Science and Engineering 174: 776-789. https://doi.org/10.1016/j.petrol.2018.11.067

Berrar, D. 2018. Cross-validation. Encyclopedia of Bioinformatics and Computational Biology 1: 542-545. https://doi.org/10.1016/B978-0-12-809633-8.20349-X

BPS. 1967. Jumlah Penduduk Miskin (Ribu Jiwa) Menurut Provinsi dan Daerah 2022-2023. https://www.bps.go.id/indicator/23/185/1/jumlah-penduduk-miskin-menurut-provinsi.html

Breiman, L. 2001. Random forests. Machine Learning 45: 5-32. https://doi.org/https://doi.org/10.1023/A:1010933404324

Chai, T. & Draxler, R.R 2014. Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature. Geosci. Model Development 7(3): 1247-1250. https://doi:10.5194/gmd-7-1247-2014

Han, S. 2022. Spatial stratification and socio-spatial inequalities: The case of Seoul and Busan in South Korea. Humanities and Social Sciences Communications 9: 23. https://doi.org/10.1057/s41599-022-01035-5

Hao, J., Luo, S. & Pan, L. 2022. Rule extraction from biased random forest and fuzzy support vector machine for early diagnosis of diabetes. Scientific Reports 12: 9858. https://doi.org/10.1038/s41598-022-14143-8

He, L., Levine, R.A., Fan, J., Beemer, J. & Stronach, J. 2018. Random forest as a predictive analytics alternative to regression in institutional research. Practical Assessment, Research and Evaluation 23(1): 1-16.

Ikotun, A.M., Ezugwu, A.E., Abualigah, L., Abuhaija, B. & Heming, J. 2023. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences 622: 178-210.

Kaufman, L. & Rousseeuw, P.J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley.

Kaushik, M. & Mathur, B. 2014. Comparative study of k-Means and hierarchical clustering techniques. International Journal of Software & Hardware Research in Engineering 2(6): 93-98.

Knifton, L. & Inglis, G. 2020. Poverty and mental health: Policy, practice and research implications. BJPsych Bulletin 44(5): 193-196. https://doi.org/10.1192/bjb.2020.78

Kullarni, V.Y. & Sinha, P.K. 2013. Efficient learning of random forest classifier using disjoint partitioning approach. Proceedings of the World Congress on Engineering 2013. Vol II. July 3-5, London.

Liemohn, M.W., Shane, A.D., Azari, A.R., Petersen, A.K., Swiger, B.M. & Mukhopadhyay, A. 2021. RMSE is not enough: Guidelines to robust data-model comparisons for magnetospheric physics. Journal of Atmospheric and Solar–Terrestrial Physics 218: 105624. https://doi.org/10.1016/j.jastp.2021.105624

Lilik Sugiharti, Rudi Purwono, Miguel Angel Esquivias & Hilda Rohmawati. 2023. The nexus between crime rates, poverty, and income inequality: A case study of Indonesia. Economies 11(2): 62. https://doi.org/10.3390/economies11020062

Lipesa, B.A., Okango, E., Omolo, B.O. & Omondi, E.O. 2023. An application of a supervised machine learning model for predicting life expectancy. SN Applied Sciences 5(7): 189. https://doi.org/10.1007/s42452-023-05404-w

Liu, M., Hu, S., Ge, Y., Heuvelink, G.B.M., Ren, Z. & Huang, X. 2021. Using multiple linear regression and random forests to identify spatial poverty determinants in rural China. Spatial Statistics 42: 100461. https://doi.org/10.1016/j.spasta.2020.100461

Marcot, B.G. & Hanea, A.M. 2021. What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis? Computational Statistics 36(3): 2009-2031. https://doi.org/10.1007/s00180-020-00999-9

Muhammad Al Faruq & Indah Yuliana. 2023. The effect of population growth on poverty through unemployment in East Java Province in 2017-2021. Journal of Social Research 2(6): 1900-1915. https://doi.org/10.55324/josr.v2i6.872

Nichols, J.A., Chan, H.W.H. & Baker, M.A.B. 2019. Machine learning: Applications of artificial intelligence to imaging and diagnosis. Biophysical Reviews 11(1): 111-118. https://doi.org/10.1007/s12551-018-0449-9

Nicolaus, Evy Sulistianingsih & Hendra Perdana. 2016. Penentuan jumlah cluster optimal pada median linkage dengan indeks validitas silhouette. Buletin Ilmiah Math. Stat. dan Terapannya (Bimaster) 05(2): 97-102.

Nowak-Brzezińska, A. & Gaibei, I. 2022. How the outliers influence the quality of clustering? Entropy 24(7): 917. https://doi.org/10.3390/e24070917

Omran, M.G.H., Engelbrecht, A.P. & Ayed Salman. 2007. An overview of clustering methods. Intelligent Data Analysis 11(6): 583-605. https://doi.org/10.3233/ida-2007-11602

Pérez-Ortega, J., Almanza-Ortega, N.N. & Romero, D. 2018. Balancing effort and benefit of k-means clustering algorithms in big data realms. PLoS ONE 13(9): e0201874. https://doi.org/10.1371/journal.pone.0201874

Peshawa Jamal Muhammad Ali & Rezhna Hassan Faraj. 2014. Data normalization and standardization: A technical report. Machine Learning Technical Reports 1(1): 1-6. https://docs.google.com/document/d/1x0A1nUz1WWtMCZb5oVzF0SVMY7a_58KQulqQVT8LaVA/edit#

Pitafi, S., Anwar, T. & Sharif, Z. 2023. A taxonomy of machine learning clustering algorithms, challenges, and future realms. Applied Sciences (Switzerland) 13(6): 3529. https://doi.org/10.3390/app13063529

Pratama, Y.C. 2015. Analisis faktor-faktor yang mempengaruhi kemiskinan Di Indonesia. Esensi 4(2): 45-53. https://doi.org/10.15408/ess.v4i2.1966

Reliusman Dachi, Didi Nuryadin & Joko Susanto. 2022. Determinan tingkat kemiskinan di Kepulauan Nias tahun 2011 - 2019: Pendekatan regresi spasial. Syntax Literate: Jurnal Ilmiah Indonesia 7(7): 8994-9008.

Rousseeuw, P.J. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20(1): 53-65. https://doi.org/10.1016/0377-0427(87)90125-7

Schonlau, M. & Zou, R.Y. 2020. The random forest algorithm for statistical learning. Stata Journal 20(1): 3-29. https://doi.org/10.1177/1536867X20909688

Sena, S. 2018. Pengenalan deep learning Part 8: Gender classification using pre-trained network (transfer learning). Medium. https://medium.com/@samuelsena/pengenalan-deep-learning-part-8-gender-classification-using-pre-trained-network-transfer-37ac910500d1

Shukla, S. 2014. A review on k-means data clustering approach. International Journal of Information & Computation Technology 4(17): 1847-1860. http://www.irphouse.com

Singh, D. & Singh, B. 2022. Feature wise normalization: An effective way of normalizing data. Pattern Recognition 122: 108307. https://doi.org/10.1016/j.patcog.2021.108307

Spada, A., Fiore, M. & Galati, A. 2023. The impact of education and culture on poverty reduction: Evidence from panel data of European countries. Social Indicators Research https://doi.org/10.1007/s11205-023-03155-0

Syakur, M.A., Khotimah, B.K., Rochman, E.M.S. & Satoto, B.D. 2018. Integration k-means clustering method and elbow method for identification of the best customer profile cluster. IOP Conference Series: Materials Science and Engineering 336: 012017. https://doi.org/10.1088/1757-899X/336/1/012017

Taye, M. 2023. Understanding of machine learning with deep learning: Architectures, workflow, applications and future directions. Computers 12: 91. https://doi.org/10.3390/computers12050091

The World Bank. 2022. Poverty. הארץ. 2022. https://www.worldbank.org/en/topic/poverty/overview

Tri Wahyudi & Titi Silfia. 2022. Implementation of data mining using k-means clustering method to determine sales strategy in S&R baby store. Journal of Applied Engineering and Technological Science 4(1): 93-103. https://doi.org/10.37385/jaets.v4i1.913

Watson, D.S. 2023. On the philosophy of unsupervised learning. Philosophy & Technology 36: 28. https://doi.org/10.1007/s13347-023-00635-6

Yunendah Nur Fuadah, Ibnu Dawan Ubaidillah, Nur Ibrahim, Fauzi Frahma Taliningsing, Nidaan Khofiya SY & Muhammad Adnan Pramuditho. 2022. Optimasi convolutional neural network dan k-fold cross validation pada sistem klasifikasi glaukoma. ELKOMIKA: Jurnal Teknik Energi Elektrik, Teknik Telekomunikasi, & Teknik Elektronika 10(3): 728-741. https://doi.org/10.26760/elkomika.v10i3.728

*Corresponding author; email: arnita@unimed.ac.id

content